from utils import *
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
check_missing_columns(train)
id False industry True state False request_date False term False employee_count False business_new False business_type False location False other_loans False loan_amount False insured_amount False default_status False dtype: bool
Industry column in train dataset contains missing values. Let's dive deeper to see how many records contains missing industry.
train[train.industry.isna()]
| id | industry | state | request_date | term | employee_count | business_new | business_type | location | other_loans | loan_amount | insured_amount | default_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1946 | 3771775001 | NaN | NH | 20-Nov-09 | 12 | 1 | New | 0 | Rural | N | $100.00 | $75,000.00 | 0 |
1 record contains missing industry value. Let's drop that record.
train = train.dropna()
train.shape
(2401, 13)
Check if the test dataset contains any missing values in its columns.
check_missing_columns(test)
id False industry False state False request_date False term False employee_count False business_new False business_type False location False other_loans False loan_amount False insured_amount False dtype: bool
Test dataset does not contain any missing values.
Check if the test dataset contains categorical values that are not present in train dataset.¶
columns_missing_categories = check_categories(
test, train,
categorical_fields = ['industry', 'state', 'business_new', 'business_type', 'location', 'other_loans']
)
columns_missing_categories
Checking categorical column: industry
Unique categories in train: {'Hotel', 'Real Estate', 'Healthcare', 'Education', 'Finance', 'Agriculture', 'Transportation', 'Energy', 'Others', 'Trading', 'Entertainment', 'Consulting', 'Engineering', 'Manufacturing', 'Construction', 'Administration'}
Unique categories in test: {'Hotel', 'Real Estate', 'Healthcare', 'Education', 'Finance', 'Agriculture', 'Transportation', 'Energy', 'Others', 'Trading', 'Entertainment', 'Engineering', 'Manufacturing', 'Construction', 'Administration', 'Consulting'}
Categories in test but not train: set()
Checking categorical column: state
Unique categories in train: {'CT', 'MI', 'NH', 'NM', 'TX', 'IL', 'CA', 'HI', 'IA', 'LA', 'NE', 'WI', 'OR', 'GA', 'WV', 'UT', 'ID', 'DE', 'SD', 'PA', 'NC', 'VT', 'OH', 'NY', 'CO', 'MD', 'AL', 'WY', 'AR', 'TN', 'ME', 'MT', 'VA', 'NJ', 'OK', 'AZ', 'KS', 'NV', 'ND', 'IN', 'WA', 'MO', 'AK', 'SC', 'MN', 'MS', 'KY', 'MA', 'FL', 'RI'}
Unique categories in test: {'CT', 'MI', 'NH', 'NM', 'TX', 'HI', 'CA', 'IL', 'IA', 'NE', 'LA', 'WI', 'OR', 'GA', 'UT', 'ID', 'DE', 'SD', 'NC', 'PA', 'VT', 'OH', 'NY', 'CO', 'MD', 'AL', 'AR', 'TN', 'ME', 'MT', 'VA', 'NJ', 'OK', 'AZ', 'KS', 'NV', 'ND', 'IN', 'WA', 'MO', 'AK', 'SC', 'MN', 'MS', 'MA', 'KY', 'FL'}
Categories in test but not train: set()
Checking categorical column: business_new
Unique categories in train: {'New', 'Existing'}
Unique categories in test: {'New', 'Existing'}
Categories in test but not train: set()
Checking categorical column: business_type
Unique categories in train: {0, 1}
Unique categories in test: {0, 1}
Categories in test but not train: set()
Checking categorical column: location
Unique categories in train: {'Rural'}
Unique categories in test: {'Rural'}
Categories in test but not train: set()
Checking categorical column: other_loans
Unique categories in train: {'N', 'Y'}
Unique categories in test: {'N', 'Y'}
Categories in test but not train: set()
[]
All of the categorical values present in test dataset can also be found in the train dataset. Additionally, it could be observed that the location field only has one unique value, 'Rural' in both the train and test datasets. Since every record has the same location 'Rural', the location field would not have any discriminative power over default_status. Hence, let's drop the location field from the potential list of features to include in the model.
train = train.drop(columns = ['location'])
test = test.drop(columns = ['location'])
train = convert_to_datetime(train)
train = convert_amt_cols_to_float(train, amt_cols = ['loan_amount', 'insured_amount'])
train.head(2)
loan_amount <class 'str'> 2401 Name: count, dtype: int64 count 2.401000e+03 mean 2.045728e+05 std 3.643876e+05 min 3.000000e+03 25% 2.500000e+04 50% 5.000000e+04 75% 2.169900e+05 max 4.000000e+06 Name: loan_amount, dtype: float64 insured_amount <class 'str'> 2401 Name: count, dtype: int64 count 2.401000e+03 mean 1.550500e+05 std 3.114833e+05 min 1.700000e+03 25% 1.275000e+04 50% 3.500000e+04 75% 1.250000e+05 max 4.000000e+06 Name: insured_amount, dtype: float64
| id | industry | state | request_date | term | employee_count | business_new | business_type | other_loans | loan_amount | insured_amount | default_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4050975007 | Others | VA | 2010-04-27 | 34 | 4 | New | 0 | N | 35000.0 | 35000.0 | 1 |
| 1 | 3735095001 | Manufacturing | CA | 2009-11-05 | 107 | 1 | New | 0 | N | 15000.0 | 13500.0 | 1 |
test = convert_to_datetime(test)
test = convert_amt_cols_to_float(test, amt_cols = ['loan_amount', 'insured_amount'])
test.head(2)
loan_amount <class 'str'> 601 Name: count, dtype: int64 count 6.010000e+02 mean 1.885422e+05 std 3.085025e+05 min 2.000000e+03 25% 2.500000e+04 50% 5.194000e+04 75% 2.180000e+05 max 2.000000e+06 Name: loan_amount, dtype: float64 insured_amount <class 'str'> 601 Name: count, dtype: int64 count 6.010000e+02 mean 1.469891e+05 std 2.720625e+05 min 1.000000e+03 25% 1.275000e+04 50% 3.500000e+04 75% 1.275000e+05 max 1.500000e+06 Name: insured_amount, dtype: float64
| id | industry | state | request_date | term | employee_count | business_new | business_type | other_loans | loan_amount | insured_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3999155010 | Hotel | CA | 2010-03-26 | 91 | 1 | Existing | 1 | N | 270000.0 | 243000.0 |
| 1 | 4035035009 | Hotel | WA | 2010-04-19 | 124 | 0 | Existing | 0 | N | 443574.0 | 432000.0 |
train = create_loan_insured_features(train)
train.head(2)
| id | industry | state | request_date | term | employee_count | business_new | business_type | other_loans | loan_amount | insured_amount | default_status | loan_insured_amount_diff | insured_loan_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4050975007 | Others | VA | 2010-04-27 | 34 | 4 | New | 0 | N | 35000.0 | 35000.0 | 1 | 0.0 | 1.0 |
| 1 | 3735095001 | Manufacturing | CA | 2009-11-05 | 107 | 1 | New | 0 | N | 15000.0 | 13500.0 | 1 | 1500.0 | 0.9 |
test = create_loan_insured_features(test)
test.head(2)
| id | industry | state | request_date | term | employee_count | business_new | business_type | other_loans | loan_amount | insured_amount | loan_insured_amount_diff | insured_loan_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3999155010 | Hotel | CA | 2010-03-26 | 91 | 1 | Existing | 1 | N | 270000.0 | 243000.0 | 27000.0 | 0.900000 |
| 1 | 4035035009 | Hotel | WA | 2010-04-19 | 124 | 0 | Existing | 0 | N | 443574.0 | 432000.0 | 11574.0 | 0.973907 |
train.request_date.describe()
count 2401 mean 2010-03-15 00:34:47.130362368 min 2009-10-01 00:00:00 25% 2009-12-14 00:00:00 50% 2010-03-11 00:00:00 75% 2010-06-02 00:00:00 max 2010-09-30 00:00:00 Name: request_date, dtype: object
test.request_date.describe()
count 601 mean 2010-03-11 00:38:20.166389248 min 2009-10-01 00:00:00 25% 2009-12-09 00:00:00 50% 2010-03-04 00:00:00 75% 2010-06-07 00:00:00 max 2010-09-30 00:00:00 Name: request_date, dtype: object
The request_date range for both the train and test datasets are the same, starting from 2009-10-01 and ending on 2010-09-30. request_date wise, it seems that both the train and test datasets were drawn from the same distribution.
Analyse the distribution of target variable, default_status¶
plot_categories_distribution(train, 'default_status', width = 800, height = 400)
It could be observed that the train dataset is imbalanced in terms of the target variable, default status. The number of companies that do not default (i.e. default_status = 0) is ~2x of the number of companies that default (i.e. default_status = 1). In such a case, accuracy might not be the most appropriate performance metric to use, as it is bias towards the majority class (i.e. default_status = 0). Instead f1 might be a more appropriate performance metric to use, as it is the harmonic mean of precision and recall, and hence unbiased towards either of the classes.
Analyse the probability of default given a categorical value¶
plot_category_default_distribution(train, 'industry')
The plot above suggests that there is a strong correlation between the industry and the default status. Some industries are significantly more prone to default as compared to others.
For example, it could be observed that the top 3 industries most likely to default are Construction, Hotel and Engineering, each having more than or equal to 0.375 probability of default. On the other hand, the Energy industry is the least likely to default, with only a 0.11 probability of default.
plot_category_default_distribution(train, 'state')
The plot above suggests that there is a strong correlation between the state and the default status. Some states are significantly more prone to default as compared to others.
For example, it could be observed that companies in the states DE, AZ, GA and AR are likely to default more than 50% of the time (i.e. their probability of default is greater than 0.5), whereas companies in the state CT are likely to default only 11% of the time (i.e. their probability of default = 0.11)
plot_category_default_distribution(train, 'business_new', height = 400)
The plot above suggests that there is no strong correlation between business_new and the default status, given that the probability of default for New and Existing businesses is approximately the same at 0.3; New and Existing businesses are equally likely to default.
plot_category_default_distribution(train, 'business_type', height = 400)
The plot above suggests a weak correlation between business_type and the default status, given that the probability of default for business_type = 0 (i.e. 0.33) is slightly higher than business_type = 1 (i.e. 0.27).
plot_category_default_distribution(train, 'other_loans', height = 400)
The plot above suggests a moderate correlation between other_loans and the default status, given that the probability of default for no other loans = 0 (i.e. 0.36) is moderately higher than with other loans (i.e. 0.22).
Analyse the correlation between numerical feature values and default status¶
px.box(train, x = 'term', color = 'default_status')
From the plot above, it could be observed that in general, companies tend to default on loans with shorter terms. The median loan term for default is 57 whereas the median loan term for non-default is 84.
px.box(train, x = 'employee_count', color = 'default_status')
From the plot above, it could be observed that in general, the companies that default have fewer employees (i.e. median 3) than companies that do not default (i.e. median 4).
px.box(train, x = 'loan_amount', color = 'default_status')
From the plot above, it could be observed that companies tend to default on smaller loan amounts. The median default loan amount is 33k, whereas the median non-default loan amount is 95.911k.
px.box(train, x = 'insured_amount', color = 'default_status')
From the plot above, it could be observed that companies tend to default when the insured amount is lesser. The median default insured amount is 22.5k, whereas the median non-default insured amount is 50k.
px.box(train, x = 'loan_insured_amount_diff', color = 'default_status')
From the plot above, it could be observed that companies tend to default when the difference between the loan amount and insured amount, loan_insured_amount_diff is lesser. The median default loan_insured_amount_diff is 2250, whereas the median non-default loan_insured_amount_diff is 24.75k.
px.box(train, x = 'insured_loan_ratio', color = 'default_status')
From the plot above, it could be observed that companies tend to default when the insured_loan_ratio is higher. The median default insured_loan_ratio is 0.9, whereas the median non-default insured_loan_ratio is 0.75.
Encode categorical features into numerical values¶
train, test, encoder_dict = encode_categorical_features(train, test)
Perform Nested Stratified K-Fold Cross-Validation¶
We perform nested stratified k-fold cross-validation to estimate:
- The average model performance on the full train dataset
- The final number of boosting rounds required for training with the full train dataset
How this works¶
- We first stratify split our train dataset into N partitions, based on the default_status label.
- We keep 1 partition as the held-out test set and use the remaining N - 1 partitions for model training and validation.
- We then further stratify split the N - 1 parititions dataset into another K partitions, based on the default_status label and perform K-fold cross validation with it. How this works is that we use K - 1 partitions for training and the remaining 1 partition for hyperparameter tuning. We repeat this process iteratively until all K partitions have been used for hyperparameter tuning. From these series of experiments, we estimate the optimal set of hyperparameters to use for training on the full N - 1 partitions dataset.
- Following which, we train our model with the full N - 1 partitions dataset and the optimal set of hyperparameters.
- After that, we compute the performance of our model (i.e. trained on N - 1 partitions dataset) on the held-out test set.
- We repeat steps 2 - 5 iteratively until all N partitions have been used as the held-out test set.
- Finally, we compute the average test performance of the models based on all N held-out test sets.
num_folds = 5
# Factor to scale num_boost_rounds by in lightgbm model training
# Factor is calculated based on the ratio of full dataset size / dataset size used for training during each fold of cross validation
scaling_factor = num_folds / (num_folds - 1)
features = ['industry', 'state', 'term', 'employee_count',
'business_new', 'business_type', 'other_loans', 'loan_amount',
'insured_amount', 'loan_insured_amount_diff', 'insured_loan_ratio']
cat_features = ['industry', 'state', 'business_new', 'business_type', 'other_loans']
# Perform nested stratified k fold cross validation
stratified_kfold_results = nested_stratified_kfold_cv(train, features, cat_features, num_folds = 5, scaling_factor = scaling_factor)
Outer train shape: (1920, 14) test shape: (481, 14) Inner train shape: (1536, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.126758 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1250 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321615 -> initscore=-0.746362 [LightGBM] [Info] Start training from score -0.746362 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [60] valid_0's binary_logloss: 0.281848 Best iteration: 60 Validation accuracy: 0.8958333333333334 Inner train shape: (1536, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000127 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1252 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321615 -> initscore=-0.746362 [LightGBM] [Info] Start training from score -0.746362 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [54] valid_0's binary_logloss: 0.222113 Best iteration: 54 Validation accuracy: 0.921875 Inner train shape: (1536, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000124 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1250 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321615 -> initscore=-0.746362 [LightGBM] [Info] Start training from score -0.746362 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [30] valid_0's binary_logloss: 0.335548 Best iteration: 30 Validation accuracy: 0.8619791666666666 Inner train shape: (1536, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1041 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000146 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1251 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322266 -> initscore=-0.743379 [LightGBM] [Info] Start training from score -0.743379 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [63] valid_0's binary_logloss: 0.255137 Best iteration: 63 Validation accuracy: 0.8828125 Inner train shape: (1536, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1041 [LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000101 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1248 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322266 -> initscore=-0.743379 [LightGBM] [Info] Start training from score -0.743379 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [63] valid_0's binary_logloss: 0.248313 Best iteration: 63 Validation accuracy: 0.9010416666666666 Average best iteration: 54.0 Scaled average best iteration: 67 [LightGBM] [Info] Number of positive: 618, number of negative: 1302 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000138 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1269 [LightGBM] [Info] Number of data points in the train set: 1920, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321875 -> initscore=-0.745168 [LightGBM] [Info] Start training from score -0.745168 Test accuracy: 0.9230769230769231 Outer train shape: (1921, 14) test shape: (480, 14) Inner train shape: (1536, 14) val shape: (385, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000124 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1247 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321615 -> initscore=-0.746362 [LightGBM] [Info] Start training from score -0.746362 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [44] valid_0's binary_logloss: 0.250855 Best iteration: 44 Validation accuracy: 0.8961038961038961 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1043 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000122 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1253 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321405 -> initscore=-0.747321 [LightGBM] [Info] Start training from score -0.747321 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [50] valid_0's binary_logloss: 0.223282 Best iteration: 50 Validation accuracy: 0.9192708333333334 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1043 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000255 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1241 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321405 -> initscore=-0.747321 [LightGBM] [Info] Start training from score -0.747321 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [48] valid_0's binary_logloss: 0.257452 Best iteration: 48 Validation accuracy: 0.8802083333333334 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000119 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1253 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [53] valid_0's binary_logloss: 0.278266 Best iteration: 53 Validation accuracy: 0.8776041666666666 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000134 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1243 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [43] valid_0's binary_logloss: 0.261752 Best iteration: 43 Validation accuracy: 0.8854166666666666 Average best iteration: 47.6 Scaled average best iteration: 59 [LightGBM] [Info] Number of positive: 618, number of negative: 1303 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000136 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1267 [LightGBM] [Info] Number of data points in the train set: 1921, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321707 -> initscore=-0.745936 [LightGBM] [Info] Start training from score -0.745936 Test accuracy: 0.8979166666666667 Outer train shape: (1921, 14) test shape: (480, 14) Inner train shape: (1536, 14) val shape: (385, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000123 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1247 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321615 -> initscore=-0.746362 [LightGBM] [Info] Start training from score -0.746362 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [50] valid_0's binary_logloss: 0.258908 Best iteration: 50 Validation accuracy: 0.8805194805194805 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1043 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000138 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1249 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321405 -> initscore=-0.747321 [LightGBM] [Info] Start training from score -0.747321 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [34] valid_0's binary_logloss: 0.301894 Best iteration: 34 Validation accuracy: 0.8697916666666666 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 494, number of negative: 1043 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000130 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1249 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321405 -> initscore=-0.747321 [LightGBM] [Info] Start training from score -0.747321 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [56] valid_0's binary_logloss: 0.248482 Best iteration: 56 Validation accuracy: 0.90625 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000130 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1256 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [43] valid_0's binary_logloss: 0.293417 Best iteration: 43 Validation accuracy: 0.8828125 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000122 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1254 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [54] valid_0's binary_logloss: 0.232413 Best iteration: 54 Validation accuracy: 0.9088541666666666 Average best iteration: 47.4 Scaled average best iteration: 59 [LightGBM] [Info] Number of positive: 618, number of negative: 1303 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000139 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1269 [LightGBM] [Info] Number of data points in the train set: 1921, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321707 -> initscore=-0.745936 [LightGBM] [Info] Start training from score -0.745936 Test accuracy: 0.9145833333333333 Outer train shape: (1921, 14) test shape: (480, 14) Inner train shape: (1536, 14) val shape: (385, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1041 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000118 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1242 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322266 -> initscore=-0.743379 [LightGBM] [Info] Start training from score -0.743379 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [36] valid_0's binary_logloss: 0.236268 Best iteration: 36 Validation accuracy: 0.9116883116883117 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000120 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1253 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [34] valid_0's binary_logloss: 0.284573 Best iteration: 34 Validation accuracy: 0.8697916666666666 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000120 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1243 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [54] valid_0's binary_logloss: 0.244987 Best iteration: 54 Validation accuracy: 0.8932291666666666 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000130 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1251 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [61] valid_0's binary_logloss: 0.263548 Best iteration: 61 Validation accuracy: 0.8828125 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 496, number of negative: 1041 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000145 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1252 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322707 -> initscore=-0.741361 [LightGBM] [Info] Start training from score -0.741361 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [57] valid_0's binary_logloss: 0.264164 Best iteration: 57 Validation accuracy: 0.8880208333333334 Average best iteration: 48.4 Scaled average best iteration: 60 [LightGBM] [Info] Number of positive: 619, number of negative: 1302 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000150 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1269 [LightGBM] [Info] Number of data points in the train set: 1921, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322228 -> initscore=-0.743552 [LightGBM] [Info] Start training from score -0.743552 Test accuracy: 0.9 Outer train shape: (1921, 14) test shape: (480, 14) Inner train shape: (1536, 14) val shape: (385, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1041 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000123 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1249 [LightGBM] [Info] Number of data points in the train set: 1536, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322266 -> initscore=-0.743379 [LightGBM] [Info] Start training from score -0.743379 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [52] valid_0's binary_logloss: 0.247448 Best iteration: 52 Validation accuracy: 0.8883116883116883 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000120 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1250 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [55] valid_0's binary_logloss: 0.261386 Best iteration: 55 Validation accuracy: 0.8880208333333334 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000120 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1249 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [38] valid_0's binary_logloss: 0.32511 Best iteration: 38 Validation accuracy: 0.8489583333333334 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 495, number of negative: 1042 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000132 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1248 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322056 -> initscore=-0.744339 [LightGBM] [Info] Start training from score -0.744339 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [40] valid_0's binary_logloss: 0.216731 Best iteration: 40 Validation accuracy: 0.9166666666666666 Inner train shape: (1537, 14) val shape: (384, 14) [LightGBM] [Info] Number of positive: 496, number of negative: 1041 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000129 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1253 [LightGBM] [Info] Number of data points in the train set: 1537, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322707 -> initscore=-0.741361 [LightGBM] [Info] Start training from score -0.741361 Training until validation scores don't improve for 5 rounds Early stopping, best iteration is: [49] valid_0's binary_logloss: 0.251924 Best iteration: 49 Validation accuracy: 0.890625 Average best iteration: 46.8 Scaled average best iteration: 58 [LightGBM] [Info] Number of positive: 619, number of negative: 1302 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000142 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1265 [LightGBM] [Info] Number of data points in the train set: 1921, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.322228 -> initscore=-0.743552 [LightGBM] [Info] Start training from score -0.743552 Test accuracy: 0.8916666666666667
Compute average model performance metrics on validation and test sets¶
avg_val_acc = np.mean(stratified_kfold_results['val_acc'])
avg_test_acc = np.mean(stratified_kfold_results['test_acc'])
avg_num_boost_rounds = int(np.mean(stratified_kfold_results['test_num_boost_rounds']))
print(f'Average validation model accuracy: {avg_val_acc}')
print(f'Average test model accuracy: {avg_test_acc}')
print(f'Average test num_boost_rounds: {avg_num_boost_rounds}')
Average validation model accuracy: 0.889939935064935 Average test model accuracy: 0.9054487179487181 Average test num_boost_rounds: 60
Train model on full train dataset¶
model_path = './models/model_final.txt'
full_num_boost_rounds = int(np.mean(stratified_kfold_results['test_num_boost_rounds']) * scaling_factor)
print(f' Number of iterations used to train model on full dataset: {full_num_boost_rounds}')
final_model = train_lgb_model(
train = train,
num_boost_rounds = full_num_boost_rounds,
features = features,
cat_features = cat_features,
model_path = model_path,
)
Number of iterations used to train model on full dataset: 75 [LightGBM] [Info] Number of positive: 773, number of negative: 1628 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000246 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1281 [LightGBM] [Info] Number of data points in the train set: 2401, number of used features: 11 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.321949 -> initscore=-0.744828 [LightGBM] [Info] Start training from score -0.744828
Run model inference on test dataset and save the inference results to a csv file.
generate_submissions(final_model, test, features = features)
test shape: (601, 13) Submissions shape: (601, 2) Saved submissions to submissions_addison.csv.
Explain how different features impact model predictions with SHAP values¶
import shap
explainer = shap.Explainer(final_model.predict, test[features])
shap_values = explainer(test[features])
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html Permutation explainer: 602it [00:44, 11.13it/s]
Overall Feature Importance¶
max_display = 11
shap.plots.bar(shap_values, max_display = max_display)
From the plot above, it could be observed that the top 5 most influential features, in descending order are:
- term
- loan_insured_amount_diff
- state
- insured_amount
- industry
The bottom 2 features, business_new and business_type, are observed to have minimal to no effect on the model predictions.
Effect of feature values on model predictions¶
shap.summary_plot(shap_values)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
From the plot above, it could be observed that in general, lower values of term, loan_insured_amount_diff, insured_amount, employee_count and insured_loan_ratio have a positive impact on the model prediction score (i.e. increased probability of default), while higher values have a negative impact on the model prediction score (i.e. decreased probability of default).
Local contribution of feature values on model prediction for a single observation¶
Local contribution of features values on model prediction, default_status = 1 for a single observation
idx = test[test.default_status == 1].index[0]
print(test.iloc[idx])
shap.plots.waterfall(shap_values[idx], max_display = max_display)
id 3794845001 industry 15 state 20 request_date 2009-12-04 00:00:00 term 29 employee_count 8 business_new 1 business_type 0 other_loans 1 loan_amount 340000.0 insured_amount 75000.0 loan_insured_amount_diff 265000.0 insured_loan_ratio 0.220588 model_score 0.571516 default_status 1 Name: 3, dtype: object
From the plot above, it could be observed that a short term (i.e. 29) and low insured_loan_ratio (i.e. 0.221) heavily influences the model to predict that the company is highly likely to default. Conversely, a high loan_insured_amount_diff (i.e. 265,000), influences the model to predict that the company is less likely to default.
Local contribution of features values on model prediction, default_status = 0 for a single observation
idx = test[test.default_status == 0].index[-1]
print(test.iloc[idx])
shap.plots.waterfall(shap_values[idx], max_display = max_display)
id 3901935000 industry 12 state 21 request_date 2010-02-11 00:00:00 term 84 employee_count 1 business_new 0 business_type 0 other_loans 1 loan_amount 66073.0 insured_amount 15000.0 loan_insured_amount_diff 51073.0 insured_loan_ratio 0.227022 model_score 0.013591 default_status 0 Name: 600, dtype: object
From the plot above, it could be observed that a long term (i.e. 84) and high loan_insured_amount_diff (i.e. 51,073) heavily influences the model to predict that the company is very unlikely to default.